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Abstract 

The sparse representation classifier (SRC) proposed in [T] has recently gained much attention from 
the machine learning community. It makes use of minimization, and is known to work well for 
data satisfying a subspace assumption. In this paper, we use the notion of class dominance as well 
as a principal angle condition to investigate and validate the classification performance of SRC, 
without relying on 11 minimization and the subspace assumption. We prove that SRC can still 
work well using faster subset regression methods such as orthogonal matching pursuit and marginal 
regression, and its applicability is not limited to data satisfying the subspace assumption. We 
illustrate our theorems via various real data sets including face images, text features, and network 
data. 

Keywords: sparse representation classification, il minimization, orthogonal matching pursuit, 
marginal regression, class dominance, principal angle 


1. Introduction 

Recently there is a surge in utilizing the sparse representation and regularized regression for 
many machine learning tasks in computer vision and pattern recognition. Applications include [2], 
n, la, a, a, a, among many others. In this paper, we concentrate on one specific but profound 
application - the sparse representation classification (SRC), which is proposed by a exhibits 
state-of-the-art performance for robust face recognition. 

For the classification task, denote x S 7^"* as the testing observation and X G 'j^-rny.n 
matrix of training data with all columns pre-scaled to unit-norm. Each column of X is denoted 
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as Xi for i = 1,..., n, representing a training observation with a known class label yi G [1,..., if]. 
The sparse representation classifier consists of two steps: for each testing observation x, first it 
finds a sparse representation /3 such that x = X(3 + e; then the class of the testing observation is 
determined by g{x) = SiVgunnk=i,...,K \\x — X(}k\\ 2 , where g{-) : 7^"* —{1, • ■ • ,K} is the classifier, 
and /3fc takes the values from /3 that are associated with data of class fc, i.e., = /?(i) if yi = k,Q 

otherwise. Denoting the true but unknown class of x as ?/ e SRC correctly finds the true 

label if g{x) = y. This classifier has been numerically shown to work well and be robust against 
occlusion and contamination on face images, and argued to be better than nearest-neighbor and 
nearest-subspace rules in [1]. 

Clearly finding an appropriate sparse representation is the crucial step of SRC, which is intrin¬ 
sically subset regression, i.e., apply certain method to select a subset of data Xg G X, and then 
take the corresponding regression vector as the sparse representation (3. Most works on sparse 
representation have been using regularized regression methods to achieve the sparsity, for which 
£1 minimization/Lasso are very popular choices due to their theoretical justifications [7], [S], [S], 
m, [iH: m, [S], m, [la, m, [n]> etc. The literature in £1 minimization and Lasso are more 
than abundant, and usually emphasizes how the £1 regularization can help recover the most sparse 
model. But how the regularization may help the subsequent inference is usually a difficult ques¬ 
tion to answer in practice, and the role of £1 minimization is not entirely clear for this particular 
classification task. 

The initial motivation for [1] to use £1 minimization is its equivalence to £0 minimization (i.e., 
sparse model recovery) under various conditions, such as the incoherence condition [5] or restricted 
isometry property m- Namely if the testing observation x does have a unique and correct most 
sparse representation /3 (correct in the sense of g{x) = y) with respect to the training data, then 
assuming proper conditions are satisfied, £1 minimization is an ideal choice in the SRC framework. 
But the sample training data are usually correlated, which violates many theoretical conditions 
including incoherence and restricted isometry; furthermore, the classification task requires only the 
recovered j3 to be mostly associated with data of the correct class rather than one uniquely correct 
solution, so there usually exists infinite correct /3 which are not the most sparse solution. 

Towards this direction, it is argued in [T] that if data of the same class lie in the same subspace 
while data of different classes lie in different subspaces (called the subspace assumption henceforth), 
then most data selected by £1 minimization should be from the correct class, thus yielding good 
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classification performance in SRC. Since face image data under varying lighting and expression 
roughly satisfy the subspace assumption [18], [1^, they further argue that SRC is applicable to 
face images. Indeed, based on the subspace assumption, |5] derives a theoretical condition for £1 
minimization to do perfect variable selection, i.e., all selected training data are from the correct 
class. This indicates that sparse representation is a valuable tool with £1 minimization under the 
subspace assumption. But the subspace assumption assumes a low-dimensional structure for data 
of the same class, which does not always hold and is difficult to validate in practice. 

Despite many applications of and investigations into sparse representation, the intrinsic prop¬ 
erties and mechanisms of SRC are still not well understood, and there exist evidence |20|, [21], 
[22] . |23| that neither £1 minimization nor the subspace assumption are necessary in the SRC 
framework. In particular, [21] and [23] argue that it is actually the classification step (namely 
g{x) = argminfc=i,...^if ||a; — X/3k\\2) that is most effective in SRC; they call it the collaborative 
representation, and support their claims through many numerical examples. Our previous work 
applies SRC to vertex classification [53], which also achieves good performance for network data 
without £1 minimization or the subspace assumption. 

To deepen our understanding, in this paper we target two important questions related to SRC. 
First, is the subspace assumption a necessity for SRC to perform well? And if not, when and 
how is SRC applicable with theoretical performance guarantees? Second, despite the popularity 
of £1 minimization, is this the optimal approach to do variable selection for SRC? Can we use 
other faster subset regression methods such as orthogonal matching pursuit (OMP) [25], [26] and 
marginal regression m, 

With these two target questions in mind, this paper is organized as follows. In Section we 
review the SRC framework and three subset regression methods including £I homotopy, OMP, and 
marginal regression. Section[^is the main section. In subsection [3T] we first relate SRC to a notion 
we call class dominance on the sample data. Then, based on class dominance, in subsection |3.2| 
we state a principal angle condition on the data distribution that is sufficient for the classification 
consistency of SRC. In particular, our theorems largely explain the success of SRC, are still valid 
when £I minimization is replaced by OMP or marginal regression, and can help identify data models 
for SRC to work well without requiring the subspace assumption. Our results make SRC more 
appealing in terms of theoretical foundation, computational complexity and general applicability, 
and are illustrated via various real data sets including face images, text features, and network data 
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in Section We conclude the paper in Section with all proofs relegated to Section 


2. Sparse Representation Review 

2.1. The SRC Algorithm 

We first summarize the SRC algorithm using £1 minimization in Algorithm 1, which consists of 
the subset regression step and the classification step. 


Algorithm 1 Sparse representation classification by €1 minimization 

Input: An m x n matrix X, where each column Xi represents a training observation with a 
known label yi S [1,..., AT]. An m. x 1 testing observation x with its true label y being unknown. 
Unless mentioned otherwise, we always assume each column of x and X are pre-scaled to unit 
norm, and X is not orthogonal to x (otherwise /3 is always the zero vector). 

1. Find a sparse representation of x by £1 minimization: 

Solve: P = argmin ||/3||i subject to ||a; — XP \\2 < e. (1) 

2. Classify X by the sparse representation /3: 

5(a;)=arg min llx-A/Jfclla, (2) 

break ties deterministically. For each entry of /3fc, Pk(i) = /3(*) if yi = k, Q otherwise. 

Output: The assigned label g{x). 


Solving Equationby fl minimization is the only computational costly part of SRC. There are 
many possible methods to solve £1 minimization, see in m, ISO], 0, i, among which we use the 
£1 homotopy method for subsequent analysis and numerical experiments. This method is based on 
a polygonal solution path [ST], [32] and can also be used for Lasso and least angle regression |7], 

nnj. 

Alternatively, OMP is a greedy approximation of £\ minimization and is equivalent to forward 
stepwise regression; it gains its popularity in sparse recovery due to its reduced running time 
and certain theoretical guarantees |26|, |33|, [34], [35]. Furthermore, OMP is quite similar to £\ 
homotopy in the implementation, and there exist many extensions of OMP m, m, EH]. 

As for marginal regression, it is probably the simplest and fastest way to do subset regression, 
and it has been studied and applied successfully in many areas. Despite its simplicity, it has been 
shown to work well for variable selection in high-dimensional data comparing to Lasso m, m, 
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[39j , is particularly popular in ultra-high-dimensional screening and has been applied 

to sparse representation as well [43]. We can always use OMP or marginal regression to find the 
sparse representation /3 in step 1, rather than solving Equation]^ by i\ minimization. In the next 
subsection we compare £1 homotopy, OMP, and marginal regression in more detail. 

Note that the constraint in Equation can be replaced by a: = A"/? in a noiseless setting, 
but usually e is required in order to achieve a more parsimonious model when dealing with high¬ 
dimensional or noisy data. This model selection problem, i.e., the choice of e or more generally the 
sparsity level of subset regression, is a difficult problem intrinsic to most subset regression methods. 
We will explain this issue from the algorithmic point of view in the next subsection. 

2.2. Homotopy, OMP, and Marginal Regression 

As il homotopy can be treated as an extension of OMP, and marginal regression is very simple, 
we only list the OMP algorithm in detail in Algorithm 2. 

Algorithm 2 Use orthogonal matching pursuit to solve Step 1 of SRC 

Input: The training data X, the testing observation x, and a specified iteration limit s and/or 
a residual limit e. 

Initialization: The residual ro = x, iteration count t = 1, and the selected data Xq = 0. 

1. Find the index it such that it = argmaxi=i_ „ where Xi is the Ah column of X and 

' is the transpose sign. Break ties deterministically, and add Xi^ into the selected data so that 

Xt = [Xt-i Xi_,\. 

2. Update the regression vector f3 with respect to Xt, i.e., calculate the orthogonal projection 
matrix Pt = XtXt~ with Xt~ being the pseudo-inverse, and let /? = PtX. Then update the 
regression residual as rt = {I — Pt)x. 

3. If t = s, or \rt \ < e, or |A"r(| < el„xi entry-wise, stop and let s = t; else increment t. 

Output: Xs and /3. Note that the sparse representation /3 can be enlarged from an s x 1 vector 
to an n X 1 vector based on the relative positions of Xg in X. 


The idea of OMP is the same as forward selection: at each iteration OMP finds the column 
that is most correlated with the residual, and then re-calculates the regression vector by projecting 
x onto the selected sub-matrix Xt. When the iteration limit is reached, or the residual is small 
enough, or the residual is almost orthogonal to the training data, OMP stops. 
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The homotopy method is the same as OMP in terms of the data selection, but it has an 
extra data deletion step and a different updating scheme. Conceptually, the homotopy path seeks 
(3 = min^ \\x — A’/3||2/2 + A||/3||i iteratively by reducing A from a positive number to 0, which is 
proved to solve the minimization problem and can also be used for the Lasso regression. More 
details can be found in m, EO]. Our experiments use the homotopy algorithm implemented by S. 
Asif and J. Romberg 

The marginal regression method does not involve any iteration; it simply chooses s columns out 
of X that are most correlated with the testing observation cc, and calculates f3 to be the regression 
vector with respect to the selected X^. Because marginal regression is a non-iterative method, it 
enjoys a superior running time complexity comparing to others: for the data selection step, it takes 
only 0((m + min(s, logn))n) while OMP needs O(mns); and for small s marginal regression is 
much faster than full regression (i.e., the usual £2 minimization using full training data). 

Clearly the three subset regression methods may yield different Xg and thus different /3, but they 
always coincide at s = 1, which is an important fact for the later proof. Another useful observation 
is that Xg is always full rank when using l\ homotopy or OMP (otherwise they stop), but this is 
not necessarily the case when using marginal regression after certain s. In the main section we will 
show that under a principal angle condition on the data model, all three methods can have the 
same asymptotic inferential effect, even though their sparse representation /3 may be different. 

Note that the model selection problem is inherent in the stopping criteria, and the stopping 
criteria used in Algorithm 2 are commonly applied in subset regression. For example, [33j only 
specifies the iteration limit s to stop OMP, which is suitable when the testing observation is perfectly 
recoverable; [T] stops l\ minimization for small residual \rt \ < e, which is more practical for real data, 
but a good choice of e may be data-dependent; the almost orthogonal criterion (i.e., \X'rt\ < el„xi) 
has been used in |34j , [35j for OMP to work well for sparse recovery; and other stopping criteria are 
also possible, such as Mallows’s Cp. As model selection does not affect the main theoretical results, 
we do not delve into this topic; but its finite-sample inference effect for real data is often difficult 
to quantify, so in the numerical experiments we always plot the SRC error with respect to various 
sparsity levels while setting e to be effectively zero, in order to give a fair evaluation of SRC for all 
possible models up to a certain limit. 


^http://users.ece.gatech.edu/-sasif/homotopy/ 
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3. Main Results 


Let us introduce some notations before proceeding: X denotes the training data matrix of size 
m X n, Xs denotes the selected sub-matrix of size m x s by subset regression, Xk denotes the sub¬ 
matrix of Xg whose columns are associated with class k, X-k denotes the sub-matrix of Xg whose 
columns are not of class k. Furthermore, (3 represents the regression vector or sparse representation 
with respect to Xg or X, which may be an s x 1 vector or n x 1 vector depending on the context, i.e., 
we use Xsj3 and Xj3 interchangeably, where the former is the s x 1 regression vector and the latter 
is the n X 1 sparse representation; they only differ in zero entries. Pk equals P except every entry 
not associated with class k is 0, and /3_fc = P — Pk', and similar to P, their size may be different 
depending on the context by shrinking or expanding the zero entries. 

3.1. Class Dominance in the Regression Vector 

We first define class dominance and positive class dominance for given regression vector and 
given sample data. They are not only important catalysts between the principal angle condition 
and the theoretical SRC optimality, but also crucial components underlying the empirical success 
of SRC as shown in the numerical section. 

Definition. Given P and the testing observation x and the training data X, we say class y domi¬ 
nates P if and only if \\XP-y \\2 < ||A’/?j^|| 2 . 

We say that class y positively dominates the regression vector P if and only f/||T/?y ||2 < ||T/3_fc||2 
for all k ^ y. 

Note that we say (positive) class dominance holds if and only if the correct class (positively) 
dominates the sparse representation of the testing observation. 

For any given /?, class dominance and positive class dominance together are sufficient for correct 
classification of SRC, formulated as follows. 

Theorem 1. Given P, (x,y) and X, class y dominance implies g{x) = y for SRC if class y also 
positively dominates p. 

If positive class dominance does not hold, class dominance itself is not sufficient for g(x) = y. 

Although class dominance cannot guarantee correct classification in SRC, it is closely related 
to positive class dominance and can lead to the latter in many scenarios. The next corollary is an 
example. 
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Corollary 1. Suppose K = 2, or the data are non-negative and the regression vector 13 is con¬ 
strained to be non-negative. 

Then given j3 and the sample data, class dominance implies positive class dominance, in which 
case class dominance alone is sujficient for g{x) = y in SRC. 

Despite the limitations of Corollary two-class classification problems are common; real data 
are often non-negative; and the non-negativity constraint is very useful in subset regression, such 
as the non-negative OMP [44] and the non-negative least squares [45], [46]. In fact, the condition 
in Corollary [^ can be further relaxed. For example, if the dominance magnitude is large enough 
(i.e., cliff/3_y||2 < ||ff/3y||2 for some c > 1) and the negative entries of /? are properly bounded, then 
class dominance still implies positive class dominance and is sufficient for g(x) = y in SRC. 

This indicates that class dominance is usually sufficient for correct classification, unless the 
negative entries of f3 are too large. Indeed in the numerical section we observe that class dominance 
nearly always implies g{x) = y, even though the non-negative constraint is not used in subset 
regression; and the class dominance error is usually close to the classification error. In the next 
subsection, we make use of class dominance to identify a principal angle condition on the data 
model, so that SRC can be a consistent classifier without requiring minimization and the subspace 
assumption. 

Note that the concept of class dominance appears similar to block sparsity and block coherence 
07], @E], m- But block sparsity and block coherence are used to guarantee that the fewest 
number of blocks/classes of data are used in the sparse representation, which is not directly related 
to correct classification; while our class dominance is defined for the correct class of data to dominate 
the sparse representation, which can lead to correct classification. 

3.2. Classification Consistency of SRC 

In this subsection we formalize the probabilistic setting of classification based on m- Suppose 
{X, Y), (Xi, Yi),..., {Xn, Yn) Fxy, where (X, Y) G 7^™ x {!,..., K} is the random variable 
pair generating the testing observation and its class {x,y), (X^, Y^) are the random variables gen¬ 
erating the training pair (xi,yi) for z = 1,..., n. Note that the prior probability of the data being 
in each class k should be nonzero. 

The SRC error is denoted as L = Prob{g{X) ^ Y|(Xi, Yi),..., (X„, Y„)) for the SRC classifier 
g : 72.™ —{!,..., K}. We always have L > L*, where L* is the optimal Bayes error. For SRC to 


achieve consistent classification, it is equivalent to identify a sufficient condition on Fxy so that 
L ^ L* as n —>■ oo. We henceforth consider the case that L* = 0 so that L —)■ 0 implies SRC is 
asymptotically optimal. 

Based on this probabilistic setting and the previous subsection on class dominance, the SRC 
error can be decomposed by conditioning on class dominance: 

L = Prob{c\ass dominance) x Prob{g{X) ^ Fjclass dominance) 

+ Prob{class dominance fails) x Prob{g{X) ^ yjclass dominance fails). 

= PbxPi + (1-P,3)xP2, (3) 

where Pd denotes the class dominance probability. Pi denotes the conditional classification error 
given class dominance, and P 2 denotes the conditional classification error when class dominance fails. 
Clearly the class dominance probability Pd depends on both Fxy and the subset regression method; 
moreover. Corollary indicates that Pi = 0 when X is non-negative and /3 is derived under the 
non-negative constraint, which approximately holds throughout our numerical experiments without 
the non-negative constraint. 

So for SRC to perform well, it suffices to find a condition on Fxy so that Pd is close to 1, then 
the SRC error L can be close to 0; and for SRC to be optimal beyond £1 minimization and the 
subspace assumption, the condition should be as simple and as general as possible, not requiring the 
subspace assumption, yet still achieving class dominance almost surely for most subset regression 
methods. 

First we state an auxiliary condition to ensure class dominance for given Xg of full rank, which 
serves as a starting point for the later results. 

Theorem 2. Given (3, {x,y) and any selected data matrix Xg of full rank, class dominance holds 
if and only if 

e{x,Xyl3y) <e{x,X_yl3_y), (4) 

where 6(x, ■) denotes the principal angle between x and ■. 

Therefore, when Equation holds for the selected sub-matrix Xg , class y dominates the sparse 
representation. We can convert this condition into the probabilistic setting as follows. 
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Theorem 3. Under the probabilistic setting, we define the principal angle condition as follows: for 
fixed {x,y), there exists a constant Cxy € [0,7r/2) such that 9(x,Xi) < CxylYi = y almost surely and 
6{x, [Xi, • • • , Xg]) > Cxy\Yi y,i = 1,..., s almost surely. 

Denote q as the probability that the principal angle condition holds for {X,Y) ~ Fxy- Then the 
class dominance probability Pd is asymptotically no less than q, for derived by £1 minimization 
at any given s > 1. 

Namely, class dominance holds if the within-class data are close while the between-class data 
are far away in terms of the principal angle. By Equation and Corollary it is clear that the 
principal angle condition can lead to SRC optimality, which we state as a corollary. 

Corollary 2. Suppose both the principal angle condition in Theorem^and the condition in Corol- 
lary^hold with probability q for (X, Y) ~ Fxy- Then the SRC error using £1 minimization satisfies 
L < 1 — g asymptotically. 

Thus if q —)■ 1 (i.e., all possible {x,y) in the support of Fxy satisfy the principal angle condition), 
SRC is asymptotically optimal with L —>■ 0. 

Thus this condition does not explicitly rely on the subspace assumption, yet still leads to optimal 
classification and can be used to validate SRC applicability on general data models. At s = 1, 
the principal angle condition can be easily validated by the nearest neighbor based on principal 
angle/correlation. But for large s, the between-class principal angle is more difficult to check: if 
the subspace assumption holds, the principal angle between one observation and s observations of 
other classes are usually bounded below, so the condition holds as long as the within-class angle is 
small; if the subspace assumption does not hold, the principal angle condition at large s may not 
hold even for well-separated data. 

Therefore it is sometimes useful to prove the principal angle condition together with the non¬ 
negative constraint in CorollaryBecause cos0(x, [Xi,- ■ ■ ,Xg]) equals the correlation between x 
and a linear combination of {Xi, ■ ■ ■ ,Xs], we can require the correlation between x and any non¬ 
negative linear combination of [Ai, • • • , Ag] to be small instead of 9{x, [Ai, • • • , Ag]) to be large; 
then Corollary still holds. One such application is illustrated in [53] for the adjacency matrix. 

The proof of Theorem]^ can be adapted to any of £1 minimization, OMP, and marginal regres¬ 
sion, which yields the next corollary. 
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Corollary 3. When (.1 minimization is replaced by OMP in the SRC framework, Theorem^ and 
Corollary^ still hold. 

Furthermore, if we constrain the sparsity level s such that Xg selected by marginal regression is 
full rank (which is always possible up to certain s), or the original data X itself is full rank, then 
Theorem^ and Corollary^ also hold for SRC using marginal regression or full regression. 

Therefore, not only can OMP and marginal regression be used in SRC, so can full regression. 
However, for real data it is quite common that the full training data matrix X is either rank deficient 
or very close to rank deficient (i.e., having singular values very close to 0). 

So far our principal angle condition in Theorem is quite restrictive, especially 9{x,Xi) < 
Cy\Yi = y almost surely, as it requires data of the correct class to be always close. This can be 
relaxed as long as some data of the correct class are close enough to the testing observation, at the 
cost of treating far away data of the correct class as data of another class. 

Corollary 4. Under the probabilistic setting, suppose we extend the principal angle condition as 
follows: for fixed (x,y), there exists a constant Cxy S [0,7r/2) such that 

= y) = X+iIe(^x,Xi)<c,,y + X_iIg(^x,Xi)>c,,y, (5) 

where I. is the indicator function, and 0{x,[Xi, ■ ■ ■ ,^s]) > Cxy\eitherYi y or Xi ^ X-i,i = 
1,..., s almost surely. 

Then Theorem^ Corollary^and Corollary^ still hold. Note that the previous principal angle 
condition in Theorem^^is now a special case with Prob{l 0 (^x,Xi)>c„.y) = 0- 

Overall, our results in this subsection can be interpreted as demonstrating that for any given 
model, if the within-class principal angle can be small while the between-class principal angle is 
always large, then the correct class is likely to dominate the sparse representation, and SRC will 
succeed in the classification task. The principal angle condition here is similar to the condition in 
[6]: their condition is applied on given sample data while we focus more on the distribution; and 
their condition explicitly requires the subspace assumption and l\ minimization while we do not. 

Furthermore, we have addressed the two questions regarding SRC in the introduction: our prin¬ 
cipal angle condition can be used to check whether SRC is applicable to a given data model without 
the subspace assumption, for which class dominance plays a crucial role for correct classification; 
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the theorems also indicate that SRC should perform similarly for any of the aforementioned three 
subset regression methods. They are all reflected in the numerical section. 

4. Numerical Experiments 

In this section we apply the sparse representation classifier to various simulated and real data 
sets using i\ homotopy, OMP, and marginal regression, and illustrate how our theoretical derivation 
of SRC is closely related to its numerical performance. 

All experiments are carried out by hold-out validation, and for each data set we always randomly 
split the data in half for training and testing. Then we estimate the SRC error, the class dominance 
error, the SRC error given class dominance, and the SRC error when class dominance fails, i.e., the 
estimates of L, 1 — Pq, Pi and P 2 in Equation To give a fair evaluation and account for possible 
early termination by various model selection criteria, the errors are always calculated against the 
sparsity level from s = 1,...,100, i.e., we re-calculate the regression vector and re-classify the 
testing observation for each s. 

We also add fc-nearest-neighbor (fcNN) and linear discriminant analysis (LDA) for benchmark 
purposes of the classification error. They are calculated against the projection dimensions, i.e, we 
linearly project the data into dimension d = 1,..., 100 by principal component analysis (or spectral 
embedding if the input is a dissimilarity/similarity matrix), and apply 9-nearest-neighbor (9 is just 
an arbitrary choice) and LDA on the projected data. 

In all examples, the above procedure is repeated for 100 Monte Carlo replicates with the mean 
errors presented. 

4-1. Face Images 

We first apply SRC to two face image data sets, one of which is also used by [T] to show the 
empirical advantage of SRC. 

The Extended Yale B database has 2414 face images of 38 individuals under various poses and 
lighting conditions m, m- These images are further re-sized to 32 x 32 for our experiment. Half 
of the data is used for training and the other half for testing, so m = 1024, n = 1207, and K = 38. 
We show the mean errors after 100 Monte Carlo runs in Figure 
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The CMU PIE database has 11554 images of 68 individuals under various poses, illuminations 
and expressions [53]. We also use the size 32 x 32 re-sized images for classification, so m = 1024, 
n = 5777, and K = 68. The mean errors are shown in Figure]^ 

The top left panel of each figure shows the SRC error, and we observe that the error rates for 
different subset regression methods are very similar. The best error achieved for Extended Yale B 
database is 0.0207 by OMP, and the best error for CMU PIE is 0.0239 by il minimization. Note 
that SRC by full regression achieves a mean error of 0.0606 and 0.0442 respectively, which is a bit 
worse than subset regression. As for /cNN and EDA, their error rates are always greater than 0.1 
in both data sets, which are not shown in order to better compare the SRC errors. 

For both data sets, the top right panel shows the class dominance error 1-Pd, which is slightly 
higher than the SRC errors L; their difference is less than 0.05. The classification error given class 
dominance satisfies Pi < 0.01 for all three subset regression methods and all sparsity levels, which 
is not shown by figure. The bottom panel shows P 2 , which is much higher than L and Pi. 

Those additional panels demonstrate that class dominance largely explains the success of SRC 
for face images, and the testing data can only be misclassified due to the failure of class dominance. 
Since all three subset regression methods achieve almost zero errors given class dominance, it is 
also the main reason that all methods have similar classification errors in the top left panel. 

4-2. Wikipedia Data 

Next we apply SRC to our Wikipedia documents with text and network features. We collect 1382 
English documents from Wikipedia based on the 2-neighborhood of the English article “algebraic 
geometry”, then form an adjacency matrix based on the documents’ hyperlinks and a text feature 
distance matrix based on latent semantic analysis |54] and cosine distance. The data is available 
on our website 0 and other examples of applying SRC to do vertex classification in graphs can be 
found in 

There are five classes in total for the documents, and both data sets are of size 1382 x 1382 
(because the network data is an adjacency matrix, and the text feature data is a cosine distance 
matrix). Splitting half columns for training and the other half columns for testing, we have m = 
1382, n = 691, and K = 5. The numerical performance is shown in Figure]^ and Figure]^ for the 


^http://www.cis.jhu.edu/~cshen/ 


13 






Sparsity Level 

Figure 1: SRC for Extended Yale B Database 

text and network data respectively. As the input data is a dissimilarity/similarity matrix, we use 
spectral embedding for projection prior to applying A:NN and LDA. 

The overall interpretation is similar to the face images: SRC performs quite well and is stable 
throughout different sparsity levels and different subset regression methods; the class dominance 
error 1 — Po is higher than the SRC error L (for text data they are quite close; but for network 
data they are less close as sparsity level increases); Pi < 0.007 in this example, indicating class 
dominance is crucial for correct classification; and P 2 is close to the chance line and much higher 
than L and Pi. 

Note that the SRC classification errors for text features are lower than the network counterparts, 
because the text features should be more informative than the network data; we also observe that 
SRC becomes slightly inferior to LDA at large projection dimension d for text features, which is 
not the case for the adjacency matrix. This is probably because the cosine distance is a particularly 
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Figure 2: SRC for CMU PIE Database 



suitable distance measure for text data [^, [S5], thus allowing LDA to do better at proper projection 
dimensions. This phenomenon also holds for the face images in the previous subsection: even though 
LDA performs much worse than SRC on the raw data, LDA can achieve similar error rates as SRC 
for appropriate transformations of those images m, [ 58 ]. 


5. Conclusion 

In this paper we investigate the sparse representation classifier, which becomes very popular 
recently due to its empirical success for real data. In order to better understand the theory behind 
this method, we focus on the regression and classification steps of this method, and develop the 
notion of class dominance and principal angle condition. Our derivation establishes a theoretical 
foundation of sparse representation from a different point of view from current literature, as well 
as implying that t\ minimization and the subspace assumption may not be crucial for SRC, which 
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Figure 3: SRC for Wikipedia English Documents Text Feature 

allows faster subset regression methods and easier data model verification for this method. Our 
results are illustrated by various real data analysis, including face images that roughly satisfy the 
subspace assumption, as well as text and network data that do not satisfy this assumption. 

6. Proofs 

6.1. Theorem^ and Corollary^ 

Proof. Assume that class y dominates /3, we have ||A’,S_j ,||2 < ||<T/3y||2; furthermore, if positive class 
dominance holds, we have HA/S-ylL < HA/SylL < IjA/J-felL for all k y. 

Note that we can always express the testing observation as 

X = Xj3 + e 
= X jdk + XP-k + e, 
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Figure 4: SRC for Wikipedia English Documents Network Feature 


where e is the regression residual orthogonal to both XPk and for each k, and e is always 

fixed throughout fc = 1,..., i^T for given /3, X and x. 

Thus given class dominance and positive class dominance, we have ||A’/3_j,||2 = \\x — XI3y — e \\2 < 
\\x — XI3k — f -\\2 = ||d:’/3_fe|j2 for all k ^ y. Because e is orthogonal to x — X/3k — i, by the Pythagorean 
theorem we immediately have ||a; — Xl3y\\2 < \\x — X/3k\\2 for all k ^ y. 

Therefore, y = argmin^^i ||a: — T’/3fe||2, and g(x) = y for the SRC classifier. 

Clearly if positive class dominance does not hold, there exist counterexamples that SRC fails to 
find the correct class. However, if there are only two classes (i.e., K = 2) or X and Pk are always 
non-negative (i.e., all observations are non-negative and the regression coefficients are constrained 
to be non-negative), then class y dominance guarantees that || A’/3y|j2 cannot be larger than || A’/3_fc||2 
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for all k ^ y. This is because 


\\X!3_kh = \\x-XI3k-e\\2 

— 11'^/3i + • • • + XPk-i + XPk+i + • • • + XPk\\2 
> WXPyh, 

where the last inequality easily follows when K — 2 oi X and Pk are always non-negative. There¬ 
fore, in this case class dominance implies positive class dominance, and is sufficient for correct 
classification of SRC. □ 

6.2. Theoreml^ 

Proof. We first decompose the testing observation as a: = XyPy -|- X_yP_y + e, which is essentially 
the same as in the previous proof with a different notation for easier presentation. Note that the 
regression residual e is orthogonal to each column of Xg. 

Next we consider the principal angle 6{x, Xypy). By assuming all involved entities are positive, 
we have 


COSe{x,XyPy) = \x'XyPy\/{\\x\\2\\XyPy\\2) 

— I(ll‘^y/^yll2 + i^-yP-y) ^yf^v)\/\\^vl^v\\2 
= WXyPyh + {X.yP.yYXyPy/WXyPyh 

= WXyPyh + \\X_yP_y\\2 ' COS 9 {Xy ^y , X_ y ^ _ y) , 

where the first equality holds because Xypy is a vector, the second equality follows by decomposing 
x, and the third and fourth equalities hold when there are no negative terms involved. 

Similarly, we have cos6{x, X_yP_y) = \\X_yP_y \\2 + \\XyPy\\2COs9{XyPy, X_yP_y). 

Because cos 9{XyPy, X-yP-y) is always smaller than 1 (if it is 1, XyPy is a vector in the 
same direction as X-yP-y, in which case Xg cannot be full rank), it is trivial to observe that 
cos9{x,XyPy) > cos9{x,X_yP_y) If and only if 11^4/3^112 > \\X_yP_y\\ 2 . 

When the involved entities are not always positive, the only other possible scenario is that one ab¬ 
solute term negates the positive sign, e.g., cos 6 {X, X-yP-y) = —\\X-yP-y \\2 — {X-yP-yyXyPy/\\X-yP-y\\ 2 . 
This can only happen when ||d4/3y||2 > \\X-yP-y\\ 2 , in which case we also have cos6{x, XyPy) > 


18 


cos 9{x, X-y/3-y). 

Therefore, class y dominates the regression vector /? if and only if 9{x,Xyj3y) < 9{x,X-yl3-y), 
assuming Xg is full rank. □ 

6.3. Theoreml^ 

Proof. It suffices to prove that when {x,y) satisfies the principal angle condition, the class domi¬ 
nance probability satisfies Pd —t 1- 

We proceed by first assuming that Xy is non-empty when using il homotopy. Note that Xg is 
always full rank when it is selected by £1 homotopy. 

As 9{x, Xi < Cxy)\Yx = y almost surely for some Cxy G [0,7r/2), we always have 9(x, Xy(3y) < Cxy. 
And as 9{x, [Xi, ■ ■ ■ , Aig] > Cxy)\Yi y almost surely, we have 9{x,X-yl3-y) > Cxy. 

Therefore, with probability 1 we have 9{x,Xyf3y) < 9{x,X_yl3-y), as long as Xy is non-empty. 

So it remains only to justify that Xy is non-empty asymptotically. 

We claim that under the principal angle condition, Xy is asymptotically non-empty when using 
£1 homotopy. First, as the prior probability of class y cannot be zero, the training data contains 
data of class y with probability converging to 1 as n —>■ oo. Next, conditioning on the event that X 
contains some data of class y, the first selected datum by £1 homotopy must be of class y (which 
is most correlated with the testing observation under the principal angle condition). But the first 
entered element may get deleted in the homotopy solution path, and it seems possible that Xy is 
empty at some s. 

Let us prove this is not possible by contradiction. Suppose that at certain step s, the homotopy 
path deletes an element so that Xy is empty. Because the first added element makes Xy non-empty, 
to make sure Xy is empty from certain step s onwards, the deleted element Xi € Xg must be the 
only datum of class y, i.e., x = [xi, X-y\[j3y, j3-y\' + e. 

However, because the principal angle condition guarantees that 9{x,X-y) > Cxy and 9(x,Xi) < 
Cxy, deleting Xi increases both ||e||2 and ||/3||i, and can never minimize min^g ||a; — A/3||2/2 -I- A||/3||i 
for any A > 0 (which is the objective function on the homotopy path). Thus if there is only 
one observation of class y remaining in the active set Ag, that datum can never be deleted in the 
homotopy solution path. Thus Xy is almost surely asymptotically non-empty for s > 1 under the 
principal angle condition. 

Therefore, given the principal angle condition, with probability converging to 1 we have 9{x, Xyfiy) < 
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9{x, X-yP-y). Then if the principal angle condition holds with probability q under Fxy, Pd is 
asymptotically no less than q for £1 minimization as n —)■ oo. □ 

6.4- Corollary^ 

Proof. Given the principal angle condition, class dominance holds with probability 1 asymptotically. 
So if the condition in Corollary[2also holds, i.e., class dominance implies positive class dominance so 
that class dominance alone is sufficient for correct classification, we have g{X) = Y with probability 
1 asymptotically. 

Therefore if those two conditions hold with probability q, the SRC error satisfies 

L = Pjj X Pi + (1 — Pjj) X P2 
— v O' X 0 T (1 — qf X P 2 
<l-q. 

Furthermore, if q —>■ 1, SRC is asymptotically optimal with L ^ 0. □ 

6.5. Corollary^ 

Proof. Next we consider replacing £1 minimization by other subset regression methods. 

When £1 homotopy is replaced by OMP, the only difference in our proof of Theorem concerns 
whether Xy is still non-empty when using OMP. At s = 1, OMP adds the same element into Xy 
as £1 homotopy, and the principal angle condition guarantees the first entered element is of class y 
almost surely. Unlike £1 homotopy, OMP never deletes any element on its solution path; thus all 
other proofs of Theorem]^ and Corollary [^remain the same, and OMP can achieve SRC optimality. 

When £1 homotopy is replaced by marginal regression, the first element to enter Xg coincides 
with £1 homotopy and OMP. Therefore the principal angle condition still guarantees that Xy is 
almost surely non-empty for given s > 1. However, as marginal regression only picks s training 
observations that are most correlated with the testing observation, it is possible that Xg is no longer 
full rank after certain s. Thus for the proof of Theorem to work, we need to constrain s so that 
Xg is full rank. 

Finally, for full regression, i.e., we use X directly to derive the regression vector /3 by £2 mini¬ 
mization, Xy is almost surely asymptotically non-empty as the prior probability of class y should 
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be nonzero. Therefore all proofs of Theorem and Corollary [^remain the same, and full regression 
can also achieve SRC optimality, as long as X itself is full rank. □ 

6.6. Corollary^ 

Proof. As Xi\{Yi =y) = ^+i/e(a:,Xi)<c,,„ + we may treat X_i as from an addi¬ 

tional class AT -I- 1, and keep X+i still from class y. 

Then the extended principal angle condition leads to the same class dominance result of Theo¬ 
rem]^ and Corollary and Corollary easily follow with essentially the same proofs. □ 
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